import matplotlib.pyplot as plt
from datetime import datetime
import cvxpy as cvx
import pandas as pd
import numpy as np
import fix_yahoo_finance as yf
import scipy.stats as stats
import pylab
from statsmodels import multivariate
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.decomposition import PCA
import seaborn as sns
sns.reset_orig()
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.height', 12)
pd.set_option('display.max_rows', 12)
(Please update PATH accordinly before running.)
PATH = u'/Users/Norman/Documents/GitHub/Invesco/Q2/'
## read intermediate output from pickled files if True, skipping intermediate calculations
## run and pickle intermediate output to PATH if False
read_from_exiting_output = True
The problem is to investigate single stock $i$'s contemporaneous return $r_{i,t}$ at time t. My stock of interest is Nvidia. I plan to investigate its daily returns. The time spans from 2015-04-30 to 2018-04-30.
I modeled the daily return of Nvidia as a combination of systematic return and idiosyncratic return using a top down approach: starting from the market drivers and narrowing down to individual factors. The systematic piece consists of a two-stage multi-factor regression and a low-rank approximation for latent factors using Robust PCA. From the two stage framework, it is observed that Market together with Value, Momentum and Industry (technology) factors can explain part of systematic variation of the returns. For Robust PCA, however, due to its limitations the low rank matrix could not be retrieved.
The idiosyncratic component can be captured by the sparse matrix from the output of Robust PCA. I showed that the timing of large isolated shocks can be captured by its earning surprises.
In sum, the model consists of a two-stage multifactor framework used to explain daily returns with addtion of historical earning surprise for quarter end shocks.
The systematic piece will be largely extracted using multi-factor models. The factor pool contains factors that are well known and used broadly. They are
Market: a factor with a long history, a standard component in almost every factor models
Industry (technology): industry are a pooled from companies with similar corporate structure and business foundations; it can capture variations that are unique among industry peers
Quality: a somewhat controversial factor, because there is no consensus in the academia nor practitioners that it is a real risk factor. However, stocks in this risk factor have good cash flow valuation, quality in their accounting reports; it is widely used in pratice
Value: a factor popularized by Fama-French, where the risk premium is earned by holding value stocks. This factor is easy to understand, have been proved empirically before and after academic publications
Size: a Fama-French factor, where the risk premium is earned by holding low caps. Although past research has found that the factor almost disappears after publication, it could be used as an easy proxy for liquidity.
Momentum: one of the well investigated factor, where the risk premium is paid to investors for bearing a large crash risk (negative skewness).
Minimum Volatility: a popular factor that captures the low risk anomaly in the market by by holding defensive stocks; it tend to outperform during market downturns
After checking for stationarity and multicollinearity using VIF, a series of linear regression will be performed to retrieve the systematic returns. The raw return of target stock will be first regressed against Market returns. Its residuals is then regressed against a collection of Style factors and Industry return. Then the systematic piece will process further to retieve low rank approximation of latent factors.
The low rank approximation is carried out using Robust PCA. In particular, the above two-stage regression is performed for all $n$ stocks within the tech sector to get the residual time series which spans $t$-horizon, i.e. a t-by-n matrix of two stage residuals. The Robust PCA is then applied to the t-by-n matrix to (hopefully) obtain a low rank approximation of latent factors (t-by-n) and a sparse matrix (t-by-n) containing all idiosyncratic returns/shocks of $n$ stocks.
The calculations are processed carefully. The information used are time stamp synced before and up to time $t$ to systematically avoid any forward looking bias.
Source:
yf.pdr_override()
start_date = '2015-04-30'
end_date = '2018-04-30' # 2018-01-01
all_tech_tickers = list(pd.read_csv(PATH + 'tech_tickers.csv')['Ticker symbol'].values)
ticker_subset = ['AAPL','GOOGL','MSFT','FB','ORCL','INTC','CSCO'
,'IBM','TXN','AVGO','NVDA','ADBE','CRM','ADP','QCOM']
factor_etf = ['QUAL','IVE','USMV','MTUM','SIZE','^GSPC']
# data download
if not read_from_exiting_output:
ticker_data = yf.download(all_tech_tickers,start_date,end_date,as_panel=True)
factor_data = yf.download(factor_etf,start_date,end_date,as_panel=True)
if read_from_exiting_output:
ticker_data = pd.read_pickle(PATH+'ticker_data')
factor_data = pd.read_pickle(PATH+'factor_data')
else:
pd.to_pickle(ticker_data, PATH+'ticker_data')
pd.to_pickle(factor_data, PATH+'factor_data')
## industry return from fama french website, taking the average industry returns
## Only Technology return is used
headers = pd.read_csv(PATH+'10_Industry_Portfolios_Daily.CSV',
skiprows=9,header=0,index_col=0,nrows=0)
ind_ret = pd.read_csv(PATH+'10_Industry_Portfolios_Daily.CSV',
skiprows=24235,header=0,index_col=0,
parse_dates=True,skip_footer=1)
ind_ret.columns = headers.columns
ind_ret = ind_ret[start_date:end_date]
ind_ret = ind_ret[1:]/100 # matching time horizon
## Calculating daily log returns
adj_cls_ticker = ticker_data['Adj Close']
adj_cls_factor = factor_data['Adj Close']
daily_log_ret_ticker = adj_cls_ticker.apply(np.log).diff()
daily_log_ret_factor = adj_cls_factor.apply(np.log).diff()
daily_log_ret_factor.columns = ['Value','Momentum','Quality','Size','Min Vol','Market']
Functions below corrects missing values in the times series, print out dataframe summary stats and winsorize the data if necessary.
def preprocessing(df):
"""
Remove columns whose entries are all nans or zeros; fill the remaining nans with zeros
Returns a processed dataframe
"""
df = df.dropna(axis=1,how='all') # drop columns with only nans
df = df.loc[:, (df != 0).any(axis=0)] # drop columns with only zeros
df = df.dropna(axis=0,how='all') # drop rows with only nans
df = df.loc[(df != 0).any(axis=1),:] # drop rows with only zeros
df = df.fillna(0.0)
return df
def data_validation(df, outlier_thres = 3, is_return = True):
"""
Print dataframe summary and print outliers summary if df is asset return
"""
print 'DataFrame shape:',df.shape
# print 'DataFrame unique datatypes:', df.dtypes.unique()
print 'Number of NaNs: %d' % df.isnull().values.sum()
print 'Number of Infs: %d' % np.sum(np.isinf(df.values))
df_summary = df.describe(percentiles=[0.01,0.99]).T
print '\nDataFrame summary:'
print df_summary
if is_return:
print '\nOutliers:'
outlier = df_summary.loc[(df_summary['max']>outlier_thres) |
(df_summary['min']<-outlier_thres),:]
print outlier
return
def winsorize_series(series, multiple = 1):
"""
Trim data series to multiple standard deviation away from mean
Returns a winsorized series
"""
lo = series.mean() - multiple*series.std()
hi = series.mean() + multiple*series.std()
series[series>hi]=hi
series[series<lo]=lo
return series
def winsorize_df(df):
"""
Wrapper to trim dataframe by column
Returns a trimmed dataframe
"""
return df.apply(winsorize_series, axis=1)
## preprocessing, removing NaN and zeros where appropriate
daily_log_ret_ticker = preprocessing(daily_log_ret_ticker)
daily_log_ret_factor = preprocessing(daily_log_ret_factor)
ind_ret = preprocessing(ind_ret)
## visualization, looking for bad data and outliers
data_validation(daily_log_ret_ticker)
data_validation(daily_log_ret_factor)
data_validation(daily_log_ret_factor)
Comment: All data is cleaned and there doesn't seem to be any outlier that stands out. Proceed futher with model implementation
The sector below contains functions and routines to perform linear regression. It contains functions to check for stationarity and to calcualte VIF of exogenous variables. It also contains a wrapper function to perform linear regression, with options to print regression summary statistics and residual plots.
def stationarity_test(data_series, critical_level = 0.05, print_result=False):
"""
Test if data is stationary using Augmented Dickey Fuller
"""
if isinstance(data_series, pd.Series) or isinstance(data_series, pd.DataFrame):
data_series = data_series.values
result = adfuller(data_series)
adf_stat, p_value = result[0], result[1]
if p_value > 0.05:
raise ValueError('Cannot reject the null hypothesis at %.2f level that the series is stationary' % critical_level)
if print_result:
print('ADF Statistic: %f' % adf_stat)
print('p-value: %f' % p_value)
def calculate_vif(X):
"""
Calculate VIF stats of given exogenous variable X (multi-dimensional)
Returns series of VIF stats by indexed by variable name
"""
vif = [variance_inflation_factor(X.values, ix) for ix in range(X.shape[1])]
if isinstance(X, pd.DataFrame):
vif = pd.Series(vif,index=X.columns)
return vif
def lin_reg(Y,X, add_cos = True, print_plots=True, print_summary=True):
"""
Performs linear regression, print summary stats and residual plots
Returns regression result object
"""
# dimension check
assert(len(Y) == len(X))
# add constants to X if necessary
if add_cos:X = sm.add_constant(X)
# run OLS and print results
model = sm.OLS(Y,X,hasconst=True)
res = model.fit(cov_type='HC0') # HC0 'HAC',cov_kwds={'maxlags':1}
if print_summary: print(res.summary())
if print_plots:
# remove constants for Seaborn plotting
if add_cos:X=X.iloc[:,1:]
# set up figure
fig, ax = plt.subplots(2,2,figsize=(10,10))
combined_X = (res.params*X).dropna(axis=1).sum(axis=1)
# Y vs combined X
sns.regplot(combined_X,Y,ax = ax[0,0])
ax[0,0].set_xlabel('Combined X')
ax[0,0].set_ylabel('Y')
ax[0,0].set_title('Regression Plot')
# residual scatter plot
ax[0,1].scatter(range(len(res.resid)), res.resid)
ax[0,1].set_title('Regression Residuals')
# residual acf
plot_acf(res.resid,ax=ax[1,0],lags=15)
ax[1,0].set_title('Residual ACF')
# residual pacf
plot_pacf(res.resid,ax=ax[1,1],lags=15)
ax[1,1].set_title('Residual PACF')
ax[1,1].set_ylim([-1.15,1.15])
return res
## stationarity check
combined_ts_series = pd.concat([daily_log_ret_ticker,daily_log_ret_factor,ind_ret],axis=1)
for col in combined_ts_series:
stationarity_test(combined_ts_series[col])
Comment: The routine above didn't raise any error, meaning the all the input data is stationary.
## multicollinearity check
all_factor = pd.concat([daily_log_ret_factor, ind_ret['HiTec']],axis=1)
calculate_vif(all_factor)
Comment: From the VIF test, it is clear that there exists collinearity, especially the Market factor. This might be because all the style factors are proxied by ETFs. So I decided to first perform a standard CAPM regression. Its residual will be used as endogenous variable against Style factor and Industry returns in the second stage regression.
stock_of_interest = 'NVDA'
single_stock_ret = daily_log_ret_ticker[stock_of_interest]
## CAPM
mkt_ret = daily_log_ret_factor['Market'].to_frame()
single_capm_res = lin_reg(Y=single_stock_ret, X=mkt_ret,
print_plots=True, print_summary=True)
Comment: not surprisingly, both intercept and beta coefficient from CAPM are significant. The residual does exhibit some heteroskedaticity, which is normal for financial time series. There's little serial correlation in residuals, however, which is a good sign.
CAPM only explains around 21% of the variation - not powerful, but it's a good start.
style_factor = ['Quality','Value','Min Vol','Momentum','Size']
style_ind = pd.concat([daily_log_ret_factor[style_factor],
ind_ret['HiTec']],axis=1).dropna()
calculate_vif(style_ind)
Comment: VIF output shows that there is collinearity in Quality. As is known, Quality emphasizes firm's profitability and earning sustainability. It partially resembles characteristics of Value because a comprehensive measure of Value incorporates firm's earning potential as well. As the same time, Quality can be defensive because it assembles stocks with steady earnings under market downturns. In this regard, it correlates with Min Vol. So I decide to exclude Quality from the regression and proceed with the rest as exogenous variable.
style_ind_ex_qual = style_ind[style_ind.columns[style_ind.columns!='Quality']]
calculate_vif(style_ind_ex_qual)
## style + industry
single_style_ind_res = lin_reg(Y=single_capm_res.resid,
X=style_ind_ex_qual,
print_plots=True,
print_summary=True)
Comment: the Style-industry regression is able to squeeze out around 17% of variations from the residuals. The intercept is insignificant from zero, as it is mostly stripped out in the first stage regression. Regarding the beta coefficients, Min Vol and Size are insignificant. This is expected as Size proxied by ETF contains only long postions in small cap names and Nvidia is a large cap stock. Meanwhile, Nvidia is often classfied as growth stock, quite in opposite of denfensive stocks that characterize Min Vol.
The rest of style factors all exhibits significant coefficients. This can be largely explained by Nvidia being a growth stock from tech sector because
The plot below shows the residual time series from the style-industry regression. There is still some variations in the return series and most noticebly, the spikes from time to time. To further separate returns into systematic vs idiosyncratic, a Robust PCA is performed to decompose residuals into a latent, low-rank component $Z$ and a sparse "shocks" component $S$.
single_style_ind_res.resid.plot(title='NVDA style-industry regression residual',
figsize=(12,6))
The Robust PCA can be formulated as follow:
$$\min_{Z, S} \|Z\|_* + \lambda\|S\|_1 \text{ subject to } A = Z + S$$, where $\|.\|_*$ is the nuclear norm (i.e. the sum of the singular values) approximating the rank.
The application here is to decompose a matrix $A$ containing $t$ period returns for $n$ securities into a latent, low-rank component $Z$ which can be thought of as latent factors that explain returns, and a sparse component $S$ which is a idiosyncratic piece.
In our case, $A$ is an t-by-n matrix of t-horizon residuals time-series of all $n$ tech stocks obtained from the above two stage regressions. In an idea case, the resulted $Z$ will have a $\mathbf{rank}(Z) << n$, representing $\mathbf{rank}(Z)$ latent factors. $S$ will be sparse, representing the idiosyncratic component.
A selected list of $\lambda$ will be used for tuning. Each $\lambda$ corresponds to one set of $Z$ and $S$.
# CAPM regression for all tech stocks
capm_result = {}
capm_resid = {}
for ticker in all_tech_tickers:
capm_result[ticker] = lin_reg(Y=daily_log_ret_ticker[ticker],
X=mkt_ret,
print_plots=False,
print_summary=False)
capm_resid[ticker] = capm_result[ticker].resid
capm_resid_df = pd.DataFrame.from_dict(capm_resid)
# style factor regression for all tech stocks
style_result = {}
style_resid = {}
for ticker in all_tech_tickers:
style_result[ticker] = lin_reg(Y=capm_resid_df[ticker],
X=style_ind_ex_qual,
print_plots=False,
print_summary=False)
style_resid[ticker] = style_result[ticker].resid
style_resid_df = pd.DataFrame.from_dict(style_resid)
def robustPCA(A, lamb):
"""
Performs robust PCA on a real n x p matrix A.
The argument lamb > 0 represents penalty on l_1 norm.
Returns low-rank matrix and sparse matrix
"""
S = cvx.Variable(*A.shape)
objective = cvx.Minimize(cvx.norm(A - S, "nuc") + lamb*cvx.norm(S, 1))
prob = cvx.Problem(objective)
prob.solve()
Z = A - S.value
return Z, S.value
def approxRank(Z):
"""
Count number of sv > 1e-6*sigma_max for rank approximation.
Returns an int of approximated rank
"""
sv = np.linalg.svd(Z.values, compute_uv=False)
return (sv > 1e-4*max(sv)).sum()
## Training
lamb_list = [0.1, 0.05, 0.01,0.005]
if not read_from_exiting_output:
Z_output={}
S_output={}
for lamb in lamb_list:
Z, S = robustPCA(style_resid_df.as_matrix(), lamb)
Z_output[lamb]=pd.DataFrame(Z,index=style_resid_df.index,columns=style_resid_df.columns)
S_output[lamb]=pd.DataFrame(S,index=style_resid_df.index,columns=style_resid_df.columns)
## pickle trained file to local
if not read_from_exiting_output:
pd.to_pickle(Z_output, PATH+'Z_output')
pd.to_pickle(S_output, PATH+'S_output')
else:
## read trained pickle file from local
Z_output = pd.read_pickle(PATH+'Z_output')
S_output = pd.read_pickle(PATH+'S_output')
Sectors below show the results of Robust PCA for a subset of all tech companies. First is sets of charts, one set for each $\lambda$. The charts include:
In an idea scenario, $\mathbf{rank}(Z)<
plt.style.use('ggplot')
for idx, lamb in enumerate(lamb_list):
print('\n\n')
print 'Set of charts for lambda = %.3f' % lamb
plt.figure(figsize=(32, 30))
# raw daily return series
plt.subplot(len(lamb_list), 1, 1)
for company in ticker_subset:
plt.plot(style_resid_df[company], label=company, linewidth=.5)
plt.title('Raw Returns')
plt.legend(loc='upper left', ncol=4)
# low rank returns
plt.subplot(len(lamb_list), 1, 2)
for company in ticker_subset:
plt.plot(style_resid_df.index, Z_output[lamb][company], linewidth=.5,label=company)
plt.title(r'Latent Factors, $\mathbf{rank}(Z) \approx %i$' % approxRank(Z_output[lamb]))
plt.legend(loc='upper left', ncol=4)
# sparse returns
plt.subplot(len(lamb_list), 1, 3)
S = np.array(S_output[lamb])
grid = np.arange(S.shape[0])
for col_num, company in enumerate(style_resid_df[ticker_subset]):
plt.bar(grid, S[:, col_num], label=company)
plt.title(r'Shocks Component $\lambda = %.3f$' % lamb)
plt.legend(loc='upper left', ncol=4)
plt.show()
Comment: It appears that the $rank(Z)$ is not reduced at all. This is in fact a known drawbacks of using Robust PCA for low-rank approximation, as $\|.\|_*$ is only an approximation for rank. However, judging from charts, $Z$ does capture a lot of co-variation of the selected tech stocks, as most of the time series in the middle plot oscillates within a tight band. This can be qualitatively classified as 'sysematic' variation within the sector.
Luckily the sparse component does appear to be quite sparse, even when $\lambda$ is low. I eventually choose $\lambda = 0.05$ to strike a balance between sparsity and having little variation.
Below I retrieved the sparse component of Nvidia from $S$ in attempt to look for patterns. Noticeably the spikes appear almost every 3 months. It seems to coinside with earning release date of Nvidia. I sorted the sparse return in descending order and observed that largest return appears repeatedly in May, August and November.
A positive spike could possibly indicates an earning suprise at annoucement, while negative spike a disappointment. I extracted the earning surprise data from Zacks.com and aligned it with the sparse return. Below I intended to investigate this hypothesis further.
lamb = 0.05
latent_factor_ret = Z_output[lamb]
single_stock_lret = latent_factor_ret[stock_of_interest]
sparse_ret = S_output[lamb]
single_stock_sret = sparse_ret[stock_of_interest]
single_stock_sret[abs(single_stock_sret)< 1e-4] = 0
single_stock_sret.plot(lw=1,figsize=(15,7),
title='Sparse Return of %s with $\lambda$ = %.3f'
% (stock_of_interest,lamb))
sorted_series = single_stock_sret[single_stock_sret!=0].sort_values(ascending=False)
pd.DataFrame(sorted_series.values,
index = sorted_series.index,
columns=['Sparse return']).tail(10)
earning_surprise = pd.read_excel(PATH+'nvda_earning_surprise.xlsx',index_col=0,parse_dates=True)
sparse_vs_surprise = single_stock_sret.to_frame('Sparse')
sparse_vs_surprise['Surprise'] = earning_surprise['Surprise']
sparse_vs_surprise.fillna(0,inplace=True)
surprised_date = sparse_vs_surprise.Surprise[sparse_vs_surprise.Surprise!=0].index
sparse_date = sparse_vs_surprise.Sparse[abs(sparse_vs_surprise.Sparse)>0.05].index
hit_ratio = float(len(list(set(surprised_date) & set(sparse_date))))/len(sparse_date)
print 'Hit ratio of earning surprise coinsides with sparse return whose absolute magnitude greater than 0.05: %.2f' % hit_ratio
plt.figure(figsize=(12,6))
ax1 = sparse_vs_surprise.Sparse.plot(lw=1)
ax2 = sparse_vs_surprise.Surprise.plot(secondary_y=True,lw=2,style='--')
ax1.set_ylabel('sparse')
ax1.legend(['sparse'],loc=2)
ax2.set_ylabel('surprise')
ax2.legend(['surprise'],loc=1)
ax2.set_ylim(-1,2.5)
plt.title('NVDA sparse return vs earning surprises')
Comment: Almost all earning spikes coincide with large spare return spikes (whose absolute magnitude greater than 0.05), with a hit ratio of 80%. While not capturing all the shocks, it could serves as a reasonable timing approximation. Also noted is that the level and variation of spikes is not fully captured by the surprises. This could be because the earning surprise of one single firm is not comprehensive enough to capture the correlation with the sector or market earning surprises. Also, under different economic regime, the surprise factor might be suppressed or magnified.
Nevertheless, it does not hurt to run a regression against its earning surprises. The earning surprises here is similar to a binary variable - 0 everywhere except for earning annoucement date. Below is the regression output. The result shows that the beta coefficient is significant, and R-square being 16.5%. This is a reasonable approximation for shocks, considering the variable contains almost only zero everywhere.
lin_reg(Y=single_style_ind_res.resid, X=sparse_vs_surprise['Surprise'], print_plots=True, print_summary=True)
The problem is to investigate single stock $i$'s contemporaneous return $r_{i,t}$ at time t. I modeled the daily return of Nvidia as a combination of systematic return and idiosyncratic return in a top down framework. The systematic piece is carried out under a multi-factor framework. It consists of two stage of regressions: first stage is the standard CAPM, its residual is regressed against style factors and industry returns in the second stage. The factor pool contains the following factors: Value, Momentum, Quality, Size, Min Vol and Market. From the two stage framework, it is observed that Market together with Value, Momentum and Industry together can explain some systematic variation of the return (with R-square 21% and 17%, respectively in each regression).
In an attempt to further investigate the systematic component, I performed the same two stage regression on all tech stocks, retrieved the residual and applied Robust PCA, with hope to find a low rank approximation of latent factors. However, due to limitations of Robust PCA, the low rank matrix could not be retrieved successfully.
Nevertheless, the other output from Robust PCA provided some insights about the sparse returns or shocks. Most of the dominant shocks coincided with the earning surprises. I therefore run another regression of Nvidia two-stage residuals against earning surprises. The result showed the earning surprises has some explanation power. It's not perfect of course, as spikes on earning annoucement are also correlated to peers earning and the overall market and economic dynamics.
In sum, to model the contemporaneous return on daily horizon, I would use a two-stage multifactor framework as discussed above and at earning day factoring in the surprise from earning release.
Sections below I explored and evaluated possible extensions of this research project in areas listed in Section 5 Future Steps. I will look at different ways to summerize relationships between single stock return of Nvdia and the pool of factors. Specifically, I used rolling window regression and Kalman filter as alternatives for OLS in the two-stage regression, and PCA to try to force rank reduction as opposed to Robust PCA. Additionally, I will present a possible improvement in explaining daily return shocks related to earning surprises.
The factor pool consists of the same factors as the initial draft. They are Value, Momentum, Quality, Size, Min Vol, and Market. The data source for Value, Momentum, Size, and Market uses the Fama-French database. Among them, the style factors are effectively long-short portfolios, and the Market includes a wider universe. Data for Quality and Min Vol still uses corresponding ETFs as proxies, as there is no easily accessible alternative data source.
It is natural to think investors are more optimistic during expansion and therefore reacts to shocks more optimistically. The opposite is also true. This also fits the rationale of risk premium that in good times investors demand less premium to hold a stock and as a result the present price edges higher and in bad times the reverse holds. An good metric to assess the economic state and business cycle is the Leading Index for the United States. It is a monthly leading index that models six-month growth rate of the state’s coincident index as well as other variables that lead the economy such as state-level housing permits. This should capture investors expectation of near-term economic growth.
I plan to categorize the monthly economy environment in three states: slowdown, neutral and expansion. To determine the state of a particular month, I first calculated a rolling mean $\mu_{12}$ and standard deviation $\sigma_{12}$ of past 12 months. If the current index level is greater than $\mu_{12}+\sigma_{12}$, the month will be classified as in 'economic expansion'. If it is below $\mu_{12}-\sigma_{12}$, it will be categorized as a 'slowdown'. If in between, it is treated as neutral. Both expansion and slowdown will be added to the regression against earning surprise as binary variables with synced time stamp. Hopefully the regression result will show significant improvements.
The data source of Leading Index for the United States is St.Louis Fed website.
usslind = pd.read_csv(PATH + 'USSLIND.csv',index_col=0,parse_dates=True)
above_std = (usslind - usslind.rolling(12).mean()-1*usslind.rolling(12).std()).resample('QS-FEB').first().tail(12)
above_std = (above_std > 0)*1
below_std = (usslind - usslind.rolling(12).mean()+1*usslind.rolling(12).std()).resample('QS-FEB').first().tail(12)
below_std = (below_std < 0)*1
## time stamp sync
expansion = pd.Series(np.squeeze((above_std.fillna(0)).values),
index = earning_surprise.sort_index().index[2:-1],
name='expansion')
slowdown = pd.Series(np.squeeze((below_std.fillna(0)).values),
index = earning_surprise.sort_index().index[2:-1],
name='slowdown')
surprise_econ_state = sparse_vs_surprise.Surprise.to_frame().copy()
surprise_econ_state['expansion'] = expansion
surprise_econ_state['slowdown'] = slowdown
surprise_econ_state.fillna(0,inplace=True)
res = lin_reg(Y=single_style_ind_res.resid, X=surprise_econ_state, print_plots=True, print_summary=True)
The result overall looks promising. The R-squared almost doubled from 16.6% to 28.9%. The 'Expansion' variable is strongly significant, but the 'Slowdown' lacks significance. The theory seem to hold some ground here, except that the 'economic slowdown' doesn't seem to correct for negative shocks very well.
Further digging suggests this approach has one issue that should not be overlooked. The sample size for 'expansion' and 'slowdown' is too small. Specifically, using a 12-month window the sample has only one 'expansion' in the month of earning release and three 'slowdowns' (see dataframe below). It just so happened that the only month of 'expansion' coincide with the largest postive shock, which explains the huge improvements in R-squared. This could be pure coincidence or data mining. Longer history is needed to properly examine this theory.
It seems the problem remains that the level is unmatched by earning surprise. There could be a few reasons:
At some point, there are specific events happening at specific time, which cannot be easily modelled using purely systematic approach. For example, many believed the sharp drop of 4% in 2/23/2017 was mainly caused by two downgrades from BMO Capital and Instinet. Such events cannot be handled by a systematic framework easily.
## expansion vs slowdown binary indicator
pd.concat([expansion,slowdown],axis=1)
res_to_plot = pd.concat([(surprise_econ_state*res.params).dropna(axis=1).sum(axis=1),
sparse_vs_surprise.Sparse.to_frame()],axis=1)
plt.figure(figsize=(12,4))
ax1 = res_to_plot.Sparse.plot(lw=1)
ax2 = res_to_plot[0].plot(secondary_y=True,lw=2,style='--')
ax1.set_ylabel('sparse')
ax1.legend(['sparse'],loc=2)
ax2.set_ylabel('surprise')
ax2.legend(['surprise'],loc=1)
ax2.set_ylim(-0.1, 0.25)
plt.title('NVDA sparse return vs earning surprises + economic states')
One major drawback of applying multi-factor model to explain individual stock return is the volatility in factor exposure estimates. Even in case of simple factor model such as CAPM, the betas are known to vary with time. This section attempts to explore the relationship between single stock return and factors in a dynamic setting.
In sample fit of the model is evaluated in $R^2$. There are other useful metrics such as AIC and BIC. For Kalman filter I tried to compute AIC and BIC using the package likelookhood function but it seems buggy and there is no easy fix. Therefore only $R^2$ is reported.
(I realized in the previous section I forgot to subtract daily risk free rate in the CAPM regression. Though I had implemented a corrected version, the difference in results were negligible. So I leave the initial draft untouched as one piece (treating it as a single market factor model) and focus solely on the extension.)
ff_3factor_data = pd.read_csv(PATH+'F-F_Research_Data_Factors_daily.CSV',
skiprows=4,header=0,index_col=0,parse_dates=True,skipfooter=2)
ff_mom = pd.read_csv(PATH+'F-F_Momentum_Factor_daily.CSV',
skiprows=13,header=0,index_col=0,parse_dates=True,skipfooter=2)
ff_mom.columns = ['MOM']
ff_3factor_data = ff_3factor_data.loc[start_date:end_date,:]/100
ff_mom = ff_mom.loc[start_date:end_date,:]/100
ff_hitec = ind_ret.loc[start_date:end_date,'HiTec']
all_exog = pd.concat([ff_3factor_data,ff_mom,ff_hitec,daily_log_ret_factor[['Quality','Min Vol']]],axis=1).dropna()
excess_ret = all_exog['Mkt-RF']
rf = all_exog['RF']
calculate_vif(all_exog)
Comment: similar as before, there exists strong multicollinearity in Market and Quality. The treatment stays the same: the two-stage regression will be implemented where the first stage is the standard CAPM and second stage against Industry and ex-Quality Style factors.
Below is the implementation of rolling window regression. Apart from OLS regression, it has the option to run Robust Linear Models. The reason to include this option in here is because OLS regression cannot handle outliers very well.As is known, single stock return from time to time can be a little extreme as there are market shocks that drive prices outrageously high or low. Having the ability to perform RLM is a nice extension. The robust criterion by default is Huber’s T.
def r_squared(y,y_fitted):
"""
Calculates R-squared based on y and its fitted value
"""
ybar = np.sum(y)/len(y)
ssreg = np.sum((y_fitted-ybar)**2)
sstot = np.sum((y - ybar)**2)
return ssreg / sstot
def regression_roll(Y, X, window, add_const=False, rlm=False):
"""
Rolling regression Y~X with look back window 'window'
Returns coeffcients and confidence interval for each regression,
stand errors of regression coefficients,
and statsmodel regression object
"""
date_index = Y.index[window-1:]
if add_const:
X = sm.add_constant(X)
if isinstance(X,pd.DataFrame):
columns = X.columns
x=[(X.values[i:i+window,:]) for i in range(len(X)-window+1)]
y=[(Y.values[i:i+window]) for i in range(len(Y)-window+1)]
if not rlm:
ls_r = [sm.OLS(y[i],x[i]).fit(cov_type='HC0') for i in range(len(x))]
else:
ls_r = [sm.RLM(y[i],x[i]).fit() for i in range(len(x))]
b = np.asarray([ls_r[i].params for i in range(len(ls_r))])
se = np.asarray([ls_r[i].cov_params() for i in range(len(ls_r))])
conf_dict = {date: pd.DataFrame(ls_r[i].conf_int(),
columns=['lower','upper'],
index=columns).T for i, date in enumerate(date_index)}
b = pd.DataFrame(b,date_index,columns)
se = pd.Panel(se,date_index)
conf_panel = pd.Panel.from_dict(conf_dict).swapaxes(1,0)
conf_panel['params'] = b
conf_panel = conf_panel.swapaxes(2,0)
r_sqr = r_squared(Y.values[window-1:], np.sum(b.values*X.values[window-1:],axis=1))
return conf_panel, r_sqr, ls_r
conf_panel_30, r_sqr_30, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=30)
conf_panel_60, r_sqr_60, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=60)
conf_panel_120, r_sqr_120, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=120)
conf_panel_180, r_sqr_180, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=180)
As it will be shown below, different span of rolling window can change beta quite drastically. The trend of beta seems to stablize when the window size is over 120. Naturally the question becomes what look back window is optimal for our purpose. One would tempt to use R-squared and select the optimal time span that maximizes R-squared by trying a steam of values (essentially a grid search). However, this approach will for certain end up choosing a span of 2, as 2 points can fit a line perfectly, therefore resulting in a artificially high R-squared of 100%. If the sample size is too small, it defeats the purpose of regression.
The beta estimates exhibit trends and offer some insights about its short term market exposure change. Using the 120 Day rolling window as an example, we see that its beta most of the times stays above one. Since Nvidia is usually treated as a growth stock, a beta greater than 1 seems logical. It reaches its peak above a staggering 2.5 by the end of 2017 and exhibits mean reversion post 2017. This can be explained by the strong performance of Nvidia in 2017 when its stock price more then doubled. As it comes to 2018, it experienced a slow down - many say it is due to a tempering demand from cryptocurrency miners.
One concern for choosing an arbitrary rolling window is the ambiguity. It does not have a good mathematical or economic meaning attached. The data can overfit in small window while model has poor fit if window gets too large. Additionally, some beta estimates became insignificant when the window is small. It could either be useful information or model noises. Making a decision on rolling window size appears to be challenging.
In this section, I will investigate dynamics of tiem varying beta estimates of different rolling window sizes and choose one to further investigate its time series property.
fig,ax = plt.subplots(2,2,figsize=(15,7))
coef = conf_panel_30['Mkt-RF'].copy()
ax[0,0].plot(coef.index, coef.params)
ax[0,0].fill_between(coef.index, coef.lower, coef.upper, alpha=0.2)
ax[0,0].set_title('30-Day Rolling Window CAPM (OLS) Beta, $R^2 - %.2f$%%' % (r_sqr_30*100))
coef = conf_panel_60['Mkt-RF'].copy()
ax[0,1].plot(coef.index, coef.params)
ax[0,1].fill_between(coef.index, coef.lower, coef.upper, alpha=0.2)
ax[0,1].set_title('60-Day Rolling Window CAPM (OLS) Beta, $R^2 - %.2f$%%' % (r_sqr_60*100))
coef = conf_panel_120['Mkt-RF'].copy()
ax[1,0].plot(coef.index, coef.params)
ax[1,0].fill_between(coef.index, coef.lower, coef.upper, alpha=0.2)
ax[1,0].set_title('120-Day Rolling Window CAPM (OLS) Beta, $R^2 - %.2f$%%' % (r_sqr_120*100))
coef = conf_panel_180['Mkt-RF'].copy()
ax[1,1].plot(coef.index, coef.params)
ax[1,1].fill_between(coef.index, coef.lower, coef.upper, alpha=0.2)
ax[1,1].set_title('180-Day Rolling Window CAPM (OLS) Beta, $R^2 - %.2f$%%' % (r_sqr_180*100))
Below is a quick examination of time series properties of rolling beta estimates. The sample I picked is from the 30-day rolling window. From ACF and PACF plots it exhibits some level of persistency, especially in the PACF plot where the first lag is almost 1. This appears to share some similarities with random walk. My ADF tests also rejects the beta being stationary.
First differencing indicates that the beta estimates behave like white noise. Judging from the evidence, $\beta_{t} = \beta_{t-1} + \sigma_{wn}$ seems to be a good model to use. Indeed, the beta estimates in short time frame are very volatile. It might be better to use some online learning techniques to better capture the dynamics.
fig2,ax2=plt.subplots(2,2,figsize=(15,7))
roll_beta = conf_panel_30['Mkt-RF'].params
first_diff_roll_beta = roll_beta.diff().dropna()
c = plot_acf(roll_beta,lags=20, ax=ax2[0,0], title='Autocorrelation of Rolling Beta')
d = plot_pacf(roll_beta,lags=20, ax=ax2[0,1], title = 'Partial Autocorrelation of Rolling Beta')
a = plot_acf(first_diff_roll_beta,lags=20, ax=ax2[1,0], title='Autocorrelation of First Difference of Rolling Beta')
b = plot_pacf(first_diff_roll_beta,lags=20, ax=ax2[1,1], title = 'Partial Autocorrelation of First Difference of Rolling Beta')
try:
stationarity_test(roll_beta)
except ValueError:
print '****** Caution ******'
print 'Cannot reject the null hypothesis at 0.05 level that the series is stationary'
print '*********************'
stationarity_test(first_diff_roll_beta)
Single day stock returns are more prone to outliers and extreme values. Robust linear models can be of great help here as they handles outliers relatively well. Robustness is a great feature to have because the main purpose here is to extract systematic information and leave out idiosyncratic noises which are the extreme values and outliers. As is seen below, beta estimates became more steady. In terms of $R^2$ the two has little difference. In daily return modelling I would choose RLM over OLS.
However, RLM poses the challenge of tuning. There are many linear models available to use and yet each has its own set of parameters. It is difficult to quantify (at least to me) which of these models works the best. Also, it shares a similar difficulty of selecting an optimal rolling window. In this section, I will show a comparison of OLS and RLM in $R^2$ and beta estimates.
conf_panel_30_rlm, r_sqr_30_rlm, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=30, rlm=True)
conf_panel_60_rlm, r_sqr_60_rlm, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=60, rlm=True)
conf_panel_120_rlm, r_sqr_120_rlm, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=120, rlm=True)
conf_panel_180_rlm, r_sqr_180_rlm, _ = regression_roll(Y=single_stock_ret-rf, X=sm.add_constant(excess_ret), window=180, rlm=True)
print '************** R-squared comparison OLS vs RLM **************'
print
print '30 Day Rolling Window R-squared OLS vs RLM: %.2f%% - %.2f%%' % (r_sqr_30*100, r_sqr_30_rlm*100)
print '60 Day Rolling Window R-squared OLS vs RLM: %.2f%% - %.2f%%' % (r_sqr_60*100, r_sqr_60_rlm*100)
print '120 Day Rolling Window R-squared OLS vs RLM: %.2f%% - %.2f%%' % (r_sqr_120*100, r_sqr_120_rlm*100)
print '180 Day Rolling Window R-squared OLS vs RLM: %.2f%% - %.2f%%' % (r_sqr_180*100, r_sqr_180_rlm*100)
print
print '*************************************************************'
fig2,ax = plt.subplots(2,2,figsize=(15,7))
coef = conf_panel_30['Mkt-RF'].copy()
ax[0,0].plot(conf_panel_30['Mkt-RF'].params, label='OLS') #,'b'
ax[0,0].plot(conf_panel_30_rlm['Mkt-RF'].params, label='RLM') #,'b'
ax[0,0].legend(loc='lower right')
ax[0,0].set_title('30-Day Rolling Window CAPM Beta: OLS vs RLM')
ax[0,1].plot(conf_panel_60['Mkt-RF'].params, label='OLS') #,'b'
ax[0,1].plot(conf_panel_60_rlm['Mkt-RF'].params, label='RLM') #,'b'
ax[0,1].legend(loc='lower right')
ax[0,1].set_title('60-Day Rolling Window CAPM Beta: OLS vs RLM')
ax[1,0].plot(conf_panel_120['Mkt-RF'].params, label='OLS') #,'b'
ax[1,0].plot(conf_panel_120_rlm['Mkt-RF'].params, label='RLM') #,'b'
ax[1,0].legend(loc='lower right')
ax[1,0].set_title('120-Day Rolling Window CAPM Beta: OLS vs RLM')
ax[1,1].plot(conf_panel_180['Mkt-RF'].params, label='OLS') #,'b'
ax[1,1].plot(conf_panel_180_rlm['Mkt-RF'].params, label='RLM') #,'b'
ax[1,1].legend(loc='lower right')
ax[1,1].set_title('180-Day Rolling Window CAPM Beta: OLS vs RLM')
Summary of 6.2.2:
It is a power tool to model time-varying betas in regression. It is a linear state-space model that operates recursively on streams of noisy observation (in our case, the daily excess return of Nvidia) and measurement (the daily excess market) to produce a statistically optimal estimate of the underlying system state (i.e. the betas). Mathematically, the framework can be roughly fomulated as follows:
$$ \beta_{t+1} = A_t\beta_{t} + \eta_t$$$$ Y_{t} = X_{t}\beta_{t} + \sigma_t $$, where $\eta_t$ is the transition covariance and $\sigma_t$ the observation variance (Y in our case in 1-d). Both are Gaussian zero-mean noises and $A_t$ the transition matrix at time $t$. At each time step, the model does the following:
There are a few parameters to be set initially to run the Kalman filter. They are:
Literatures show that both the Kalman fillter and OLS suffer from the presence of multicollinearity. Therefore I stick with the two stage approach.
from pykalman import KalmanFilter
import pymc3 as pm
from pymc3.distributions.timeseries import AR1
from pymc3.distributions.timeseries import GaussianRandomWalk
def kf_regression(Y,X,init_state_mean,init_state_cov,delta_r=None,add_const = True):
"""
Apply Kalman filter as online linear regression
Args:
Y: endogeous variable in pandas Series of DataFrame
X: exogenous variable in pandas Series of DataFrame
init_state_mean: initial estimates of hidden states i.e. betas, a good choice could be regression beta estimates
init_state_cov: initial cov of hidden states i.e. betas, a good choice could be regression beta cov
delta_r: a parameter between 0 and 1 for transition covariance that adjusts how quickly the slope and intercept change; default to None and EM algorithm figure out a value that maximize the likelihood
Returns:
mean and cov of hidden states, as well as residual and R-squared
"""
date_index = Y.index
if add_const:
X = sm.add_constant(X)
if isinstance(X, pd.DataFrame):
columns = X.columns
if isinstance(init_state_mean, (pd.DataFrame, pd.Series)):
init_state_mean = init_state_mean.values
if isinstance(init_state_cov, pd.DataFrame):
init_state_cov = init_state_cov.values
dim = X.shape[1]
obs_mat_r = np.expand_dims(np.vstack([X.values]), axis=1)
if delta_r is None:
kf_r = KalmanFilter(n_dim_obs=1, n_dim_state=dim,
initial_state_mean=list(init_state_mean),
initial_state_covariance=init_state_cov,
transition_matrices=np.eye(dim),
observation_matrices=obs_mat_r)
kf_r = kf_r.em(Y.values, n_iter = 10,
em_vars = ['transition_covariance',
'observation_covariance'])
state_means_r_smooth, smoothed_state_r_cov = kf_r.smooth(Y.values)
else:
assert (0<delta_r<1, 'delta_r must be between 0 and 1')
trans_cov_r = delta_r / (1 - delta_r) * np.eye(dim)
kf_r = KalmanFilter(n_dim_obs=1, n_dim_state=dim,
initial_state_mean=list(init_state_mean),
initial_state_covariance=init_state_cov,
transition_covariance = trans_cov_r,
transition_matrices=np.eye(dim),
observation_matrices=obs_mat_r)
kf_r = kf_r.em(Y.values, n_iter = 10,
em_vars = ['observation_covariance'])
state_means_r_smooth, smoothed_state_r_cov = kf_r.filter(Y.values)
state_means_r_smooth = pd.DataFrame(state_means_r_smooth,columns=columns,index = date_index)
smoothed_state_r_cov = pd.Panel(smoothed_state_r_cov,items = date_index,major_axis=columns,minor_axis=columns)
y_fitted = (state_means_r_smooth*X).sum(axis=1)
r_sq = r_squared(Y,y_fitted)
resid = Y-y_fitted
return state_means_r_smooth, smoothed_state_r_cov, r_sq, resid
init_state_mean = single_capm_res.params
init_state_cov = single_capm_res.cov_HC0
capm_r_smooth,capm_r_cov,r_sq_capm,capm_resid = \
kf_regression(single_stock_ret-rf,
excess_ret,
init_state_mean,
init_state_cov)
init_state_mean = single_style_ind_res.params
init_state_cov = single_style_ind_res.cov_HC0
style = all_exog[['HML','Min Vol','MOM','SMB','HiTec']]
style_r_smooth,style_r_cov,r_sq_style, resid_style = \
kf_regression(capm_resid, style,
init_state_mean,
init_state_cov)
As an alternative to OLS or RLM, Kalman have better interpretability in parameters mathematically. Also the use of Bayesian online updating mechanism makes more logical sense and it is less computationally expensive. The results below show that the Kalman filter performed relatively well in terms of variance explained, with a $R^2$ of around 47% for both stages. The beta estimates have more isolated peaks and troughs, which were previously buries by the wiggliness in the rolling estimates.
A closer look suggest that some peaks or troughs seem to coincide (again) with quarterly shocks. They behave more like seaonsal noise that may feed the filter with misleading information about subsequent observations. We should expect a better fit in systematic returns after removing these shocks from raw data.
fig,ax = plt.subplots(2,1,figsize=(15,8))
capm_r_smooth.plot(ax=ax[0],#label=r'KF $\beta$ smoothed',
title = 'Kalman Filter CAPM, $R^2 = %.2f$%%' % (r_sq_capm*100),
lw=3)
ax2 = ax[0].twinx()
sparse_vs_surprise.Sparse.plot(ax=ax2,color='yellowgreen',label = 'Sparse component')
ax[0].legend()
ax[0].set_ylabel('Beta')
ax2.legend(loc='upper left')
ax2.set_ylabel('Sparse return')
style_r_smooth.plot(ax=ax[1],
title = 'Kalman Filter Style Regression, $R^2 = %.2f$%%' % (r_sq_style*100))
ax3 = ax[1].twinx()
sparse_vs_surprise.Sparse.plot(ax=ax3,color='yellowgreen',label = 'Sparse component')
ax3.legend(loc='upper left')
ax3.set_ylabel('Sparse return')
ax3.set_ylim(-0.2, 0.25)
fig.tight_layout()
Similar to the approach from the first draft of this research, Robust PCA is used to generate sparse returns. The Robust PCA is performed against log returns of all tech stocks in the S&P 500 universe. The $\lambda$ is set to be 0.1 to ensure sparsity (the higher, the sparser). The sparse return will be subtracted from the raw return. The resulting excess return will be used as the exogenous variable to feed into Kalman filter.
The results below indicate a good overall fit in the 'systematic' component, with an $R^2$ of 58% in the first stage and 43% in the second. The betas are less extreme at quarter end, especially at the second stage. This appears to be a better alternative than the bare-bone Kalman filter: more variance explained and a steadier beta.
## Robust PCA Training
if not read_from_exiting_output:
Z, S = robustPCA(daily_log_ret_ticker.as_matrix(), lamb=0.1)
Z=pd.DataFrame(Z,index=daily_log_ret_ticker.index,columns=daily_log_ret_ticker.columns)
S=pd.DataFrame(S,index=daily_log_ret_ticker.index,columns=daily_log_ret_ticker.columns)
if not read_from_exiting_output:
pd.to_pickle(Z, PATH+'Z_raw_ret')
pd.to_pickle(S, PATH+'S_raw_ret')
else:
## read trained pickle file from local
Z = pd.read_pickle(PATH+'Z_raw_ret')
S = pd.read_pickle(PATH+'S_raw_ret')
S[ticker_subset].plot(title='Sparse Return of Selected Stocks, $\lambda = 0.1$',figsize=(15,4))
plt.legend(loc='upper right', ncol=4)
## Two-stage Kalman filter with sparse component removed
capm_r_sparse_rv,_,r_sq_capm_sparse_rv,capm_resid_sparse_rv = \
kf_regression(single_stock_ret-rf-S['NVDA'],
excess_ret,
single_capm_res.params,
single_capm_res.cov_HC0)
style_r_sparse_rv,_,r_sq_style_sparse_rv, resid_style_sparse_rv = \
kf_regression(capm_resid_sparse_rv,
style,
single_style_ind_res.params,
single_style_ind_res.cov_HC0)
fig,ax = plt.subplots(2,1,figsize=(15,8))
capm_r_sparse_rv['Mkt-RF'].plot(ax=ax[0],label=r'KF $\beta$ sparse removed',
title = 'Kalman Filter CAPM, $R^2 = %.2f$%%' % (r_sq_capm_sparse_rv*100),
lw=3)
capm_r_smooth['Mkt-RF'].plot(ax=ax[0],label=r'original KF $\beta$',
style='--',
lw=2)
style_r_sparse_rv.plot(ax=ax[1],
title = 'Kalman Filter Style Regression - Sparse Removed, $R^2 = %.2f$%%' % (r_sq_style_sparse_rv*100))
ax[0].legend()
fig.tight_layout()
Comment: Removing sparse returns could be a systematic process rather than a specific application to Nvidia. They are essentially noise in the system, resulted from individual market event of different stocks at different times. These sparse returns should be studied separately as many may not have good way to explain in a systematic framework.
One is overfitting. As charts showed, the beta estimates are not very smooth, especially in the CAPM case even after removing the sparse return. It is loaded with information. Making a sound judgement on the beta dynamics seems challenging. A workaround is to decrease $\lambda$ in Robust PCA to make the sparse return include more 'noise'. It is not a easy trade-off to make, however, as the sparse return becomes harder to examine once it is too dense.
Apart from tuning $\lambda$, one could instead tune the transition covariance $\eta_t$ that adjusts how quickly the slope and intercept change. It could potentially be more tricky, as $\eta_t$ in the CAPM case is 2-by-2 (constant and excess return). In fact, I tried with different scales of $\eta_t$ as a diagonal matrix. The result was not great (the constant term explained most part of variation instead of beta) and is excluded to make the research more concise.
Below I examined the time series properties of the seemingly irregular beta estimates. I showed that the beta estimates from first stage (CAPM) Kalman filter were very persistence using ACF-PACF plots. I fitted the estimates in a ARIMA model. Overall the ARIMA model seems to perform decently; it identified the persistency and mean reverion in the series and and resulting residuals were close to white noise.
## ACF PACF plots for first stage (CAPM) beta
fig,ax=plt.subplots(1,2,figsize=(15,3))
a = plot_acf(capm_r_sparse_rv['Mkt-RF'],lags=30,ax=ax[0], title='Autocorrelation of First Stage (CAPM) Beta')
b = plot_pacf(capm_r_sparse_rv['Mkt-RF'],lags=30,ax=ax[1], title='PACF of First Stage (CAPM) Beta')
## Fitting ARIMA model to first stage (CAPM) beta
model = SARIMAX(capm_r_sparse_rv['Mkt-RF'], order=(1,1,0))
model_fit = model.fit()
print(model_fit.summary())
fig, ax3 = plt.subplots(1,2,figsize=(15,3))
plot_pacf(model_fit.resid,ax=ax3[0],lags=60)
ax3[0].set_title('Residual PACF')
plot_acf(model_fit.resid,ax=ax3[1],lags=60)
ax3[1].set_title('Residual ACF')
fig.tight_layout()
Comment: Judging from the results, the first-stage Kalman filter beta estimates can be fitted as an ARIMA(1,1,0) in the form of $$\beta_t-\beta_{t-1} \approx 0.71(\beta_{t-1}-\beta_{t-2}) + WN(0,0.002)$$, or $$\beta_t \approx 1.71\beta_{t-1}-0.71\beta_{t-2} + WN(0,0.002)$$ The series is very persistent in first lag, with some mean reversion in the second lag. This may explain the periodical humps we saw from the data. Also, there was little serial correlation in the ARIMA(1,1,0) residual, indicating a good explanation power of the model.
One may also tempt to check for the statistical significance of these beta estimates as well as their confidence intervals. This can be tricky, as the hidden state $\beta_t$ is not a parameter estimate from the model. It is almost a one-time estimate, akin to the transition covariance $\eta_t$ if assumed random walk. Using OLS as a comparison, the $\hat\beta_{ols}$ is a estimate of the underlying true beta $\beta_{ols}$. It is $t$-distributed with error approaching 0 when sample is infinitely large. To my understanding, the setup is entirely different. I did my best trying to find literature online in testing significance of hidden states but there was no luck. However, significance testing is an interesting topic that might be fit to other parameters in Kalman filter.
Frankly, Kalman filter is an advance topic that requires experience and knowledge to fully grasp. There may be lots of places that can be improved in this model. It is an interesting topic which I will dig deeper in the future. I would also very much welcome an opportunity to discuss this further.
So far we have applied two-stage Kalman filter to the daily return series in excess of sparse returns. The model has captured a decent amount of variation, and before moving further to PCA, it is worth spending a few minutes understanding the time series properties of the residual from the two stage process. I showed that unlike daily log return, the residual is in fact not stationary and can be fitted in a MA($q$) model.
resid_style_sparse_rv.plot(title='Two-Stage Kalman Filter Residual of NVDA', figsize=(15,4))
## ACF PACF plots for two-stage residual
fig,ax = plt.subplots(1,2,figsize=(15,3))
a = plot_acf(resid_style_sparse_rv, lags=30, ax=ax[0], title='Autocorrelation of Two-Stage Residual')
b = plot_pacf(resid_style_sparse_rv, lags=30, ax=ax[1], title='PACF of Two-Stage Residual')
Comment: Section above plots the ACF and PACF of residual up to 20 lags. From the charts, it is reasonable to believe that the residual follows an MA($q$) process, and therefore it is not stationary. The value of $q$ is not so easily pinned down. I experimented different $q$ with different intial values and solvers and finally came up with $q=3$, which produced significant lag coefficients and stationary residual. The result is shown in the sections below.
## Fitting MA(q) model to two-stage residual
model = SARIMAX(resid_style_sparse_rv, order=(0,0,3))
model_fit = model.fit(start_params=np.array([0.19,0.1,0.1,5e-06]))
print(model_fit.summary())
fig, ax3 = plt.subplots(1,2,figsize=(15,3))
plot_pacf(model_fit.resid,ax=ax3[0],lags=40)
ax3[0].set_title('Residual PACF')
plot_acf(model_fit.resid,ax=ax3[1],lags=40)
ax3[1].set_title('Residual ACF')
fig.tight_layout()
Summary of 6.2.3:
The most widely known technique of dimension reduction in finance is arguably PCA. It is a useful tool to extract and select statistical factors for a multi-factor model. It first operates on the covariance matrix of a set of assets. A series of PC associated with high eigenvalue will then be selected as the statistical factors.
In our case, the number of daily samples ($T=755$) are much greater than the total number of tech stocks ($N=74$). The covariance estimates should be reasonably accurate and shrinkage seems unnecessary. In practice, many would also exponentially weight the data to give declining weights to observations as they recede in the past. The intention is to get a more accurate measure of covariance for the most recent date. However, it is not the focus of this research. Therefore the implementation of PCA is in its purest form.
The real challenge here is how much information can be extracted after the two-stage process. Another way to put it: how much PCA can still be of use if the correlation matrix is close to identity. Subsequently, how do we select PC factors if the top few only explains a small portion of covariation? Does it still make sense to use PCA? Sections below will examine these questions in more details.
Data used is the residuals from the two stage Kalman filter of all tech stocks with their sparse returns removed.
if not read_from_exiting_output:
kf_style_resid = {}
for ticker in all_tech_tickers:
_,_,_,kf_capm_resid = \
kf_regression(daily_log_ret_ticker[ticker]-rf-S[ticker],
excess_ret,
capm_result[ticker].params,
capm_result[ticker].cov_HC0)
_,_,_, kf_style_resid[ticker] = \
kf_regression(kf_capm_resid_df,
style,
style_result[ticker].params,
style_result[ticker].cov_HC0)
kf_style_resid = pd.DataFrame.from_dict(kf_style_resid)
## pickle trained file to local
if not read_from_exiting_output:
pd.to_pickle(kf_style_resid, PATH+'kf_style_resid')
else:
## read trained pickle file from local
kf_style_resid = pd.read_pickle(PATH+'kf_style_resid')
Comment: below displays two heatmaps of correlations of the stocks in the tech sector. One for the raw daily log return, the other for the residual return after the two stage process. The difference is huge. In the first chart the average correlation seem to be at around 0.5 with little variation cross sectionally, whereas the second chart shows a relatively sparse correlation matrix. Decomposing a sparse matrix into principal component does not seem to help much. In the extreme case of a diagonal covariance matrix, the PC factors will just be the raw return themselves. If we were to regress raw return as these PC factors, it would just be the raw return regressing against itself!
More realistically, the real problem is that the top factors only explains a limited portion of covariation. Including too many factors in the model may very well overfit the data, let alone PC factors has little economic interpretability. This is a tough decision. As I do not have enough evidence to completely give up on PCA, I will select as few PCs factors as possible while ensuring certain explanation power.
fig,ax=plt.subplots(1,2,figsize=(12,4))
sns.heatmap(daily_log_ret_ticker.corr(),ax=ax[0],xticklabels=False,yticklabels=False)
ax[0].set_title('Corr Heatmap: Tech Sector Raw Return')
sns.heatmap(kf_style_resid.corr(),ax=ax[1],xticklabels=False,yticklabels=False)
ax[1].set_title('Corr Heatmap: Tech Sector 2-Stage Resid Return')
Comment: As is seen below, the cumulative proportion of covariation explained increases in a very slow pace: it takes up to 9 factors to cover over 50% of variance. Generally, it would not be a good practice to include more PC factors then the real economically meaningful factors in a model. However it does not hurt to include one or two to add some explanation power. For completeness I chose the top 2 and run a regression of Nvidia's residual against these two PC factors
pca_sk = PCA(n_components=15).fit(kf_style_resid.values)
pca_summary = pd.DataFrame()
pca_summary['value'] = pca_sk.explained_variance_
pca_summary['Difference'] = pca_summary.value.diff(-1)
pca_summary['Proportion'] = pca_sk.explained_variance_ratio_
pca_summary['Cumulative Proportion'] = pca_summary.Proportion.cumsum()
pca_summary.index = pca_summary.index+1
pca_summary.round(6).head(10)
## PCA regression
pca_cov = multivariate.pca.PCA(kf_style_resid, standardize=False, demean=False)
for col in pca_cov.factors:
stationarity_test(pca_cov.factors[col])
a = lin_reg(Y = kf_style_resid['NVDA'], X = pca_cov.factors.iloc[:,:2],print_plots=True,print_summary=True)
Comment: the beta for the second PC is in fact not significant. For completeness I run the regression with only the top factor and report the results below.
a = lin_reg(Y = kf_style_resid['NVDA'], X = pca_cov.factors.iloc[:,:1],print_plots=True,print_summary=True)
Comment: the top PC factor explains rougly 17% of the variation. It is still an acceptable result, though below expectation of what normally PCA can do. The regression residual shows some MA($q$) behavior. This is explained in section 6.2.3.4 as the exogenous variable can be modelled by a MA(3) process.
Guided by Future Steps in Section 5, I have explored different directions to push the research further. In earning surprise modelling, I used the forward looking Leading Index for United States as a gauge for economic growth and identified months of economic expansion and slowdown. The state of expansion or slowdown could be a magnifier of earnings surprises and was used as a binary variable to add to the regression. At the first glance the regression result looked promising. However, data of longer history is needed to evaluate the 'magnifier' hypothesis further.
In two stage regression, I examined the three alternatives: rolling window OLS, rollling window RLM (using Huber T) and Kalman filter as online regression. In sample model fit were compared using $R^2$. Result showed that rolling window OLS and RLM shares similar $R^2$ but the latter produced steadier beta estimates as it is more robust to outliers. Kalman filter performed the best; its performance was improved after removing sparse return from the raw daily return data. The $R^2$ was improved from 21% to 58% in the first stage and 17% to 43% in the second stage, compared to the single period regression model.
In extracting latent factors, the popular PCA was used. However, the residuals from the two-stage process exhibited limited correlation and as a result the PCA did not perform as well as expected. The variance explained by the top factors were lacking. Regression two stage residual against the top PC factor led to a $R^2$ of 17%.